Wine Quality Analysis: Identifying Key Quality Predictors

University of Sydney ODAT5011 Project 2

Author

James Tesoriero

Published

April 13, 2025

Project Introduction

The focus of this project is on evaluating and predicting wine quality by leveraging both structured and unstructured data sources.

Drawing on datasets from the UCI Machine Learning Repository (Dua & Graff, 2017) and Kaggle’s WineMag critic reviews (Zynicide, 2017), the analysis integrates machine learning models, statistical inference, and natural language processing (NLP) to uncover patterns that influence wine quality ratings. By combining physicochemical measurements with qualitative descriptions, the report aims to provide a multi-dimensional perspective on what constitutes a high-quality wine.

The intended audience for this report includes not only academic’s but also potential industry stakeholders such as Treasury Wine Estates (TWE) — a global wine company committed to data-informed decision-making in viticulture, marketing, and consumer experience.


Industry Context: Treasury Wine Estates (TWE)

Treasury Wine Estates (TWE) is one of the world’s leading wine producers, with a portfolio that includes iconic brands such as Penfolds, Wolf Blass, and 19 Crimes. Known for its global footprint and premium positioning, TWE has demonstrated a strategic focus on innovation and analytics in both vineyard operations and market engagement (Treasury Wine Estates, 2024).

Why This Project Aligns with TWE’s Strategic Vision

  • Premium Brand Positioning:
    This analysis supports TWE’s emphasis on product quality and brand differentiation by identifying key chemical and sensory factors linked to critic-rated wine excellence.

  • Market Intelligence Through Data:
    By combining structured data (e.g. alcohol, pH, sulphates) with unstructured critic reviews, the report exemplifies a modern, data-driven approach to understanding wine attributes and consumer perceptions.

  • Applied Innovation:
    The use of machine learning and natural language processing (NLP) techniques mirrors TWE’s growing interest in leveraging digital tools for product development and personalised marketing.

  • Global Relevance and Scalability:
    While based on Portuguese wine data, the findings and modelling framework are transferable to TWE’s diverse international operations.

This project demonstrates how data analytics can be used to support both academic research and practical decision-making within the wine industry.


Executive Summary

This report investigates the key predictors of wine quality using two complementary datasets:

  1. UCI Wine Quality Dataset – containing physicochemical attributes and quality ratings for red and white Vinho Verde wines from Portugal.
  2. WineMag (Kaggle) Reviews – featuring over 5,000 expert wine reviews from Portugal, including textual descriptions, prices, and point-based ratings (reviewed scores).

Key Analytical Outcomes

  • Predictive Modelling of Wine Quality:
    Regression-based and tree-based machine learning models (including random forest and ordinal forest) were developed to assess which chemical attributes most strongly predict expert quality ratings.

  • Natural Language Processing (NLP) on Wine Descriptions:
    A text mining approach was used to analyse wine reviews, with high-scoring wines (90+ points) revealing distinct descriptive patterns. Techniques included word clouds, and logistic regression for term importance.

  • Statistical Exploration:
    The report includes hypothesis testing, t-tests, and confidence intervals to compare red and white wines, as well as analyses of pricing distributions and reviewed score variability.

  • Interactive Visualisation (Shiny App):
    A web-based dashboard was developed to allow users to explore and predict wine quality interactively using model inputs and review filters.


Data Summary

Dataset 1: UCI Wine Quality (Vinho Verde)

  • Observations: 6,497 (Red: 1,599 | White: 4,898)
  • Variables: 11 numeric predictors (e.g. alcohol, sulphates, density) + expert-rated quality score
  • Source: UCI Machine Learning Repository

Dataset 2: WineMag (Kaggle) Portuguese Wine Reviews

  • Observations: 5,322
  • Variables: Country, variety, wine colour, review score, price, critic review (text)
  • Source: Kaggle (Zynicide / Wine Reviews)

Both datasets were pre-processed to address missing data, remove outliers (based on price percentiles), and harmonise features across data types. Additional transformations (e.g., log(price)) and categorical encodings were implemented to support modelling and visualisation.

Limitations

A key limitation of this project is the inability to directly link the two datasets — the UCI Wine Quality data and the WineMag critic reviews — at the individual wine level. Due to the absence of a shared unique identifier (such as a wine name, vineyard, or product code), it was not possible to match specific wines across the datasets to connect their physicochemical properties, expert quality scores, critic reviews, and pricing information. As a result, the analysis focuses on broader trends and comparisons between red and white wines, rather than wine-specific modelling. While both datasets offer valuable insights independently, a more granular, linked dataset would enable a deeper exploration of the relationships between chemical composition, price, and perceived quality.


Code
# ----------------------------------------------------------
# Load Required Libraries
# ----------------------------------------------------------

# Libraries for data manipulation, modeling, and visualization
library(tidyverse)       # Core data manipulation and visualisation tools 
library(corrplot)        # Visualizing correlation matrices
library(randomForest)    # Traditional Random Forest modeling
library(broom)           # Tidying model outputs into tibbles
library(tm)              # Text mining framework (cleaning and tokenising)
library(wordcloud)       # Generating word cloud visualizations
library(RColorBrewer)    # Color palettes for better plot aesthetics
library(gridExtra)       # Arranging multiple ggplot objects in a grid
library(ordinalForest)   # Ordinal random forest modeling for ordered outcomes
library(MASS)            # Classic stats functions, including LDA and distributions
library(irr)             # Inter-rater reliability metrics (e.g., Cohen’s Kappa)
library(caret)           # Classification and Regression Training framework
library(ggplot2)         # Grammar of graphics visualisation package 
library(stringr)         # String manipulation utilities
library(ranger)          # Fast implementation of Random Forests
library(readr)           # Fast and tidy reading of CSV and text files
library(Polychrome)      # Load Polychrome for generating visually distinct colour palettes

# Set seed for reproducibility
set.seed(123)  # This initial seed is sufficient for reproducibility
Code
# ----------------------------------------------------------
# Load and Combine UCI Wine Quality Datasets
# ----------------------------------------------------------


wine_red <- read_delim("../data/winequality-red.csv", delim = ";")
wine_white <- read_delim("../data/winequality-white.csv", delim = ";")

wine_red$wine_colour <- "Red"
wine_white$wine_colour <- "White"

# Combine red and white datasets into a unified dataframe and clean column names
vinho_verde_data <- bind_rows(wine_red, wine_white) %>%
  rename_with(~ gsub(" ", "_", .))
Code
# ----------------------------------------------------------
# Load Wine Reviews Dataset
# ----------------------------------------------------------
wine_reviews <- read_delim("../data/winemag.csv")

# ----------------------------------------------------------
# Provide logic for wine variety colours and Filter for Portugal
# ----------------------------------------------------------

# Define red and white varieties
red_varieties <- c(
  "Cabernet Sauvignon", "Merlot", "Pinot Noir", "Syrah", "Malbec", "Tempranillo", 
  "Sangiovese", "Zinfandel", "Grenache", "Mourvèdre", "Touriga Nacional", "Baga", 
  "Aragonez", "Tinta Roriz", "Touriga Franca", "Bobal", "Alfrocheiro", "Vinhão", 
  "Argaman", "Graciano", "Garnacha Tintorera", "Tinta Amarela", "Alicante Bouschet", 
  "Aragonês", "Baga-Touriga Nacional", "Bastardo", "Bordeaux-style Red Blend", 
  "Cabernet Sauvignon and Tinta Roriz", "Cabernet Sauvignon-Syrah", "Castelão", 
  "Espadeiro", "Jaen", "Madeira Blend", "Merlot-Syrah", "Moscatel Roxo", "Petit Verdot", 
  "Petite Verdot", "Port", "Portuguese Red", "Portuguese Rosé", "Red Blend", 
  "Rhône-style Red Blend", "Rosé", "Shiraz", "Sousão", "Tinta Barroca", "Tinta Francisca", 
  "Tinta Negra Mole", "Touriga Nacional Blend", "Touriga Nacional-Cabernet Sauvignon", 
  "Trincadeira"
)
white_varieties <- c(
  "Chardonnay", "Sauvignon Blanc", "Riesling", "Pinot Grigio", "Viognier", "Gewürztraminer", 
  "Alvarinho", "Encruzado", "Arinto", "Antão Vaz", "Loureiro", "Fernão Pires", "Albana", 
  "Alvarinho-Chardonnay", "Avesso", "Azal", "Bical", "Bual", "Côdega do Larinho", "Cerceal", 
  "Chenin Blanc", "Gewürztraminer-Riesling", "Gouveio", "Malmsey", "Malvasia", 
  "Malvasia Fina", "Moscatel", "Moscatel Graúdo", "Muscat", "Pinot Blanc", 
  "Portuguese Sparkling", "Portuguese White", "Rabigato", "Sémillon", "Sercial", "Siria", 
  "Sparkling Blend", "Verdelho", "White Blend", "White Port", "Códega do Larinho"
)



# Filter for Portuguese wines and classify colour based on variety
wine_reviews_portugal_clean <- wine_reviews %>% 
  filter(country == "Portugal") %>%
  mutate(wine_colour = case_when(
    variety %in% red_varieties ~ "Red",
    variety %in% white_varieties ~ "White",
    TRUE ~ "Unknown"
  )) %>%
  rename(score = points, price_USD = price) %>%
  dplyr::select(country, variety, wine_colour, score, price_USD, description) %>%
  filter(!is.na(score), !is.na(price_USD))
Code
# ----------------------------------------------------------
# Correlation between Price(USD) and Review Score (Raw and Log)
# ----------------------------------------------------------

# Create a new column with log-transformed price to normalise distribution
wine_reviews_portugal_clean$log_price <- log(wine_reviews_portugal_clean$price_USD)

# ----------------------------------------------------------
# Raw Price Correlation with Score
# ----------------------------------------------------------

# Calculate Pearson correlation coefficient between raw price and score
corr_coef_raw <- cor(wine_reviews_portugal_clean$price_USD, wine_reviews_portugal_clean$score, use = "complete.obs")
#cat("Correlation coefficient between price_USD and score:", round(corr_coef_raw, 3), "\n")

# Perform hypothesis test for correlation between raw price and score
corr_test_raw <- cor.test(wine_reviews_portugal_clean$price_USD, wine_reviews_portugal_clean$score)
#print(corr_test_raw)

# ----------------------------------------------------------
# Log-Transformed Price Correlation with Score
# ----------------------------------------------------------

# Calculate Pearson correlation coefficient between log(price) and score
corr_coef_log <- cor(wine_reviews_portugal_clean$log_price, wine_reviews_portugal_clean$score, use = "complete.obs")
#cat("Correlation coefficient between log(price_USD) and score:", round(corr_coef_log, 3), "\n")

# Perform hypothesis test for correlation between log-transformed price and score
corr_test_log <- cor.test(wine_reviews_portugal_clean$log_price, wine_reviews_portugal_clean$score)
#print(corr_test_log)
Code
# ----------------------------------------------------------
# Define Function to Generate Word Cloud from Wine Descriptions
# ----------------------------------------------------------

create_wordcloud <- function(text_data, caption = NULL, scale_range = c(3, 0.5), caption_size = 1.2) {
  layout(matrix(1:2, ncol = 1), heights = c(0.2, 0.8))  # Caption (20%) + Wordcloud (80%)
  
  # --- Row 1: Caption ---
  par(mar = rep(0, 4))
  plot.new()
  if (!is.null(caption)) {
    text(x = 0.5, y = 0.5, labels = caption, cex = caption_size, font = 2)
  }
  
  # --- Row 2: Word Cloud ---
  par(mar = c(0, 0, 0, 0))
  
  # Clean and process the text data
  corpus <- Corpus(VectorSource(text_data))
  corpus <- tm_map(corpus, content_transformer(tolower))
  corpus <- tm_map(corpus, removePunctuation)
  corpus <- tm_map(corpus, removeNumbers)
  corpus <- tm_map(corpus, removeWords, c(stopwords("english"), "wine"))
  
  # Compute word frequencies
  tdm <- TermDocumentMatrix(corpus)
  m <- as.matrix(tdm)
  word_freq <- sort(rowSums(m), decreasing = TRUE)
  word_df <- data.frame(word = names(word_freq), freq = word_freq)
  
  # Generate the word cloud
  wordcloud(
    words = word_df$word, freq = word_df$freq, min.freq = 5,
    max.words = 200, random.order = FALSE, rot.per = 0.35,
    scale = scale_range, colors = brewer.pal(8, "Dark2")
  )
  
  # Reset layout
  layout(1)
}

Inital Analysis of Wine Reviews and Quality

Language of High-Scoring Wine Reviews

The word cloud below was generated from critic reviews of Portuguese wines that received a score above 90 points. This visualisation highlights the most frequently used descriptive terms, with more commonly occurring words displayed in larger font sizes and positioned closer to the centre of the cloud. The representation accounts for repeated word usage across different reviews, providing a sense of overall emphasis in the language used to describe high-quality wines.

Code
# ----------------------------------------------------------
# Generate Word Cloud for Red Wines with Score > 90
# ----------------------------------------------------------

red_descriptions <- wine_reviews_portugal_clean %>%
  filter(wine_colour == "Red", score > 90) %>%
  pull(description)

#create_wordcloud(red_descriptions, caption = "Red Wines (Score > 90)", scale_range = c(3, 0.5))

# ----------------------------------------------------------
# Generate Word Cloud for White Wines with Score > 90
# ----------------------------------------------------------

white_descriptions <- wine_reviews_portugal_clean %>%
  filter(wine_colour == "White", score > 90) %>%
  pull(description)

#create_wordcloud(white_descriptions, caption = "White Wines (Score > 90)", scale_range = c(3, 0.5))


# ----------------------------------------------------------
# Generate Combined Word Cloud for All Wines with Score > 90
# ----------------------------------------------------------

top_descriptions <- wine_reviews_portugal_clean %>%
  filter(score > 90) %>%
  pull(description)

create_wordcloud(
  top_descriptions,
  caption = "Most Popular Words of Highly Scored Wines (Score > 90)",
  scale_range = c(3, 0.5),
  caption_size = 1  # Decrease font size 
)

The prominence of terms like tannin, fruit, flavours, acidity, and rich in high-scoring wine reviews reflects their significance in defining wine quality and complexity. These descriptors are commonly used by wine critics to articulate the sensory experiences associated with premium wines (Jackson, 2020).

  • Tannin: Tannins contribute to the structure and aging potential of red wines. They are responsible for the drying sensation on the palate and add complexity to the wine’s profile. High-quality red wines often exhibit well-integrated tannins that enhance their overall balance and longevity.

  • Fruit: Fruit descriptors indicate the presence of primary aromas and flavours derived from the grapes. The expression of fruit notes, such as berry, cherry, or citrus, signifies the wine’s varietal character and ripeness, which are key indicators of quality.

  • Flavours: The term flavours encompasses the range of taste sensations perceived in wine, including fruit, spice, earth, and oak-derived notes. A complex flavour profile is often associated with higher-quality wines, as it reflects depth and nuance.

  • Acidity: Acidity provides freshness and balance to wine, influencing its liveliness and food-pairing versatility. Wines with well-balanced acidity are often perceived as more vibrant and are favored in quality assessments.

  • Rich: The descriptor rich denotes a wine’s fullness and intensity of flavour. Richness is typically associated with concentration and depth, qualities that are esteemed in premium wines.

These descriptors are integral to the language of wine evaluation, serving as benchmarks for assessing and communicating wine quality among professionals and enthusiasts alike.

Relationship Between Log Price and Wine Score

The scatter plot (Figure 1) of log-transformed wine price against review score reveals a moderate-to-strong positive correlation of 0.64, suggesting that, overall, higher-priced wines tend to receive higher critic ratings.

Code
# ----------------------------------------------------------
# Scatter Plot of Log(Price USD) vs Score by Wine Colour
# ----------------------------------------------------------

ggplot(wine_reviews_portugal_clean, aes(x = score, y = log(price_USD))) +
  
  # Light red points for red wine
  geom_jitter(
    data = subset(wine_reviews_portugal_clean, wine_colour == "Red"),
    colour = "#E57373", alpha = 0.5, width = 0.3, height = 0, size = 1.8
  ) +
  
  # Light blue points for white wine
  geom_jitter(
    data = subset(wine_reviews_portugal_clean, wine_colour == "White"),
    colour = "#64B5F6", alpha = 0.5, width = 0.3, height = 0, size = 1.8
  ) +
  
  # Regression line for red wine
  geom_smooth(
    data = subset(wine_reviews_portugal_clean, wine_colour == "Red"),
    aes(colour = "Red"),
    method = "lm", se = FALSE, size = 1
  ) +
  
  # Regression line for white wine
  geom_smooth(
    data = subset(wine_reviews_portugal_clean, wine_colour == "White"),
    aes(colour = "White"),
    method = "lm", se = FALSE, size = 1
  ) +

  # Define line colours for legend
  scale_colour_manual(
    name = "Wine Colour",
    values = c("Red" = "firebrick", "White" = "dodgerblue4")
  ) +

  labs(
    title = "Log(Price USD) vs. Wine Review Score by Colour",
    x = "Review Score",
    y = "Log(Price USD)",
    caption =  "Figure 1: Regression analysis showing the relationship between critic review \n     scores and the logarithm of wine prices.\nNote: Log scale reduces skew and highlights proportional differences."
  ) +

  theme_minimal(base_size = 12) +  # reduce base text size
  theme(
    plot.title = element_text(size = 14, face = "bold", hjust = 0.5),
    axis.title = element_text(size = 12),
    axis.text = element_text(size = 10),
    legend.title = element_text(size = 11),
    legend.text = element_text(size = 10),
    legend.background = element_rect(fill = "transparent", colour = NA),
    legend.key = element_rect(fill = "transparent", colour = NA),
    plot.caption = element_text(size = 10, colour = "gray30", hjust = 0)
  )

When segmented by wine colour, distinct patterns emerge:

  • Red wines not only make up the majority of the dataset but also span a wider and higher price range. The steeper regression line observed for red wines indicates a stronger association between price and score—suggesting that expensive red wines are more likely to achieve higher ratings.

  • White wines, on the other hand, are generally less expensive and receive slightly lower scores on average. The relationship between price and score is weaker among white wines, with fewer examples of high-priced or top-scoring bottles.

The median prices in USD further reflect this distinction, with red wines priced at $19 and white wines at $13. These differences suggest potential consumer expectations and market dynamics that favour premiumisation within the red wine category.

Several factors contribute to red wines typically commanding higher prices compared to white wines:

  • Production Techniques: Red wines often undergo more complex production processes, including extended maceration and aging in oak barrels, which add to the cost. Oak barrels are expensive and need to be replaced periodically, and the aging process ties up capital as the wine cannot be sold until it has matured. In contrast, white wines are typically fermented without skins and often aged in stainless steel tanks, which are less costly and allow for quicker turnaround from production to sale (Roy, 2024).

  • Grape Varieties and Cultivation: The grape varieties used for red wines, such as Cabernet Sauvignon and Pinot Noir, are often more sensitive to climate and soil conditions, making them more expensive to cultivate. Additionally, high-quality red wine grapes are frequently handpicked to ensure only the best grapes are used, increasing labour costs. White wine grapes, like Chardonnay and Sauvignon Blanc, generally have higher yields and can be machine-harvested, reducing production costs (Roy, 2024).

  • Market Demand and Perception: There is a historical perception that red wines are more complex and better suited for aging, which can lead to a willingness among consumers to pay higher prices. This perception, along with higher demand for premium red wines, influences their market price (Roy, 2024).

These factors collectively contribute to the generally higher cost of red wines compared to white wines.

Code
# ----------------------------------------------------------
# Summary Statistics by Wine Colour
# ----------------------------------------------------------

summary_stats <- wine_reviews_portugal_clean %>%
  group_by(wine_colour) %>%
  summarise(
    count = n(),
    mean_score = mean(score, na.rm = TRUE),
    sd_score = sd(score, na.rm = TRUE),
    IQR_score = IQR(score, na.rm = TRUE),
    mean_price_USD = mean(price_USD, na.rm = TRUE),
    sd_price_USD = sd(price_USD, na.rm = TRUE),
    IQR_price_USD = IQR(price_USD, na.rm = TRUE)
  )
Code
# ----------------------------------------------------------
# Convert Quality Columns for Modeling and Plotting
# ----------------------------------------------------------
# Convert quality to an ordered factor for ordinal modeling
vinho_verde_data$quality <- factor(vinho_verde_data$quality, ordered = TRUE)
# Create a numeric version of quality for regression evaluation
vinho_verde_data$quality_num <- as.numeric(as.character(vinho_verde_data$quality))
# Create a factor version for plotting purposes
vinho_verde_data$quality_factor <- factor(as.character(vinho_verde_data$quality))
# Ensure wine_colour is a factor with consistent level order
vinho_verde_data$wine_colour <- factor(vinho_verde_data$wine_colour, levels = c("Red", "White"))

Distribution of Wine Scores by Wine Colour

The following section examines the distribution of both critic review scores and objective quality ratings across red and white wines in the Portuguese wine datasets.

In the WineMag review dataset, a significantly greater number of red wines (n = 3,111) were reviewed compared to white wines (n = 1,065). This disparity may reflect a broader market presence of red wines or a greater interest from critics in reviewing red varietals.

In terms of scoring, red wines show a higher median review score of 89, whereas white wines have a median score of 87. This suggests that, on average, red wines may be perceived more favourably by critics within this dataset.

In the UCI Vinho Verde quality dataset, the sample distribution is reversed: there are more white wines (n = 4,898) than red wines (n = 1,599). This difference could be attributed to the production focus within the Vinho Verde region, where white wines are more commonly produced and studied.

When examining the distribution of quality ratings (objective assessments based on physicochemical properties), the majority of wines, regardless of colour, receive scores between 5 and 7, with very few wines rated at the extremes of the scale (i.e., scores of 3 or 9). This central clustering suggests a relatively narrow evaluation range for physicochemical-based quality, compared to the broader and potentially more subjective distribution seen in critic reviews.

These patterns are visually summarised in the two distribution plots below:

  • Figure 2 presents the distribution of review scores (subjective ratings from critics).

  • Figure 3 visualises the quality ratings (objective chemical assessments) from the UCI dataset.

Both use side-by-side bars to clearly compare the frequency of scores between red and white wines.

Code
# ----------------------------------------------------------
# Distribution of Wine Review Scores by Wine Colour
# ----------------------------------------------------------

# Bin the review scores and count frequency by wine_colour
ggplot(wine_reviews_portugal_clean, aes(x = factor(score), fill = wine_colour)) +
  geom_bar(position = "dodge") +  # Side-by-side bars for Red and White
  labs(
    title = "Wine Review Scores Distribution by Wine Colour",       # Plot title
    x = "Review Score",                                              # X-axis label
    y = "Number of Wines",                                           # Y-axis label
    fill = "Wine Colour",                                            # Legend title
    caption = "Figure 2: Distribution of critic review scores across red and white Portuguese wines."  # Caption
  ) +
  scale_fill_manual(values = c("Red" = "darkred", "White" = "goldenrod")) +  # Custom fill colours
  theme_minimal(base_size = 14) +  # Minimal clean theme with readable font size
  theme(
    plot.title = element_text(hjust = 0.5, face = "bold"),  # Centre and bold title
    axis.text = element_text(color = "black"),              # Black axis labels
    axis.title = element_text(color = "black"),             # Black axis titles
    legend.title = element_text(face = "bold"),             # Bold legend title
    plot.caption = element_text(size = 10, colour = "gray30", hjust = 0)  # Styled caption
  )

Code
# ----------------------------------------------------------
# Distribution of Wine Quality Ratings by Wine Colour
# ----------------------------------------------------------

 ggplot(vinho_verde_data, aes(x = quality_factor, fill = wine_colour)) +
  geom_bar(position = "dodge") +  # Side-by-side bars for each wine colour
  labs(
    title = "Wine Quality Ratings Distribution by Wine Colour",        # Plot title
    x = "Quality Rating",                                              # X-axis label
    y = "Number of Wines",                                             # Y-axis label
    fill = "Wine Colour",                                              # Legend title
    caption = "Figure 3: Distribution of wine quality scores by wine colour (Vinho Verde dataset)."  # Caption
  ) +
  scale_fill_manual(values = c("Red" = "darkred", "White" = "goldenrod")) +  # Custom colours
  theme_minimal(base_size = 14) +  # Clean theme with readable font size
  theme(
    plot.title = element_text(hjust = 0.5, face = "bold"),  # Centred and bolded title
    axis.text = element_text(color = "black"),              # Axis label colour
    axis.title = element_text(color = "black"),             # Axis title colour
    legend.title = element_text(face = "bold"),             # Bold legend
    plot.caption = element_text(size = 10, colour = "gray30", hjust = 0)  # Styled caption
  )

Comparison of Standardised Quality and Review Scores by Wine Colour

To better understand how objective wine quality scores compare with subjective critic reviews, both datasets were standardised using z-scores and visualised using smoothed density plots, separated by wine colour. This approach allows for a direct comparison between the UCI wine quality ratings and the Portuguese wine review scores on a consistent scale.

The following density plots (Figure 4) compare standardised scores (z-scores) from the UCI Vinho Verde quality data and the Portuguese wine review scores, separated by wine colour. By converting both scoring systems to a standard scale, we can directly compare their distributions regardless of differing rating methods or value ranges.

From the plots, a few key insights emerge:

  • For red wines, the distribution of review scores is shifted to the right relative to the quality scores, indicating that red wines tend to receive higher critic review scores than their UCI-assessed quality ratings. This may reflect a bias in consumer-facing reviews toward red wines or a preference for characteristics more common in reds.

  • For white wines, the pattern is reversed. The quality scores from the UCI dataset appear higher, on average, than the review scores from the critic reviews. This suggests that, while white wines may perform better on objective quality criteria (such as chemical properties), they are rated slightly lower in subjective critic reviews.

These differences may also stem from dataset imbalance. The wine review dataset contains more red wines, while the quality dataset includes more white wines, potentially influencing the score distributions. Additionally, the broader market exposure and pricing diversity of red wines may contribute to their higher critic scoring trends.

Overall, this standardised comparison highlights the importance of evaluating both objective quality and subjective perception in understanding wine excellence across different varieties.

Code
# ----------------------------------------------------------
# Smoothed Density Plot: Scaled Quality vs Review Score by Wine Colour
# ----------------------------------------------------------

# Create standardised scores (z-scores) for both datasets and combine into one dataframe
scaled_data <- bind_rows(
  # Standardise the UCI Vinho Verde wine quality scores
  vinho_verde_data %>%
    transmute(
      Colour = wine_colour,
      Score_z = scale(quality_num)[,1],  # z-score for wine quality
      Source = "Quality"
    ),
  # Standardise the wine review scores from the Portuguese reviews dataset
  wine_reviews_portugal_clean %>%
    transmute(
      Colour = wine_colour,
      Score_z = scale(score)[,1],  # z-score for review score
      Source = "Review Score"
    )
) %>%
  filter(Colour %in% c("Red", "White"))  # Only include Red and White wines

# Generate a palette with visually distinct colours
colors <- createPalette(10, c("#0D58FF", "#4B7A00"))

# Display the colours and swatches (for previewing)
#list(colors)
#swatch(colors)

# Assign names to the colour vector for use in scale_fill_manual
names(colors) <- unique(scaled_data$Source)

# Create density plots of standardised scores, grouped by data source and faceted by wine colour
ggplot(scaled_data, aes(x = Score_z, fill = Source)) +
  geom_density(alpha = 0.6, bw = 0.4, color = NA) +  # Smoothed density curves with transparency
  facet_wrap(~ Colour, scales = "free") +  # Separate plots for Red and White wine
  labs(
    title = "Standardized Distributions of Quality and Review Score by Wine Colour",
    x = "Standardized Score (Z)",
    y = "Density",
    fill = "Source",
    caption = "Figure 4: Smoothed density curves comparing UCI quality scores and review scores by wine colour."
  ) +
  scale_fill_manual(values = colors) +  # Apply custom colour scheme
  theme_minimal() +
  theme(
    legend.position = "top",
    plot.title.position = "plot",
    plot.margin = grid::unit(c(20, 10, 10, 10), "pt"),  # Add spacing around the plot
    plot.caption = element_text(size = 10, colour = "gray30", hjust = 0)
  )

Code
# ----------------------------------------------------------
# Filter Out Price Outliers (Above 95th Percentile) by Wine Colour
# ----------------------------------------------------------

wine_reviews_portugal_filtered <- wine_reviews_portugal_clean %>%
  group_by(wine_colour) %>%
  filter(price_USD <= quantile(price_USD, 0.95, na.rm = TRUE)) %>%
  ungroup()

Predictive Modelling of Wine Quality

Building on the exploratory analyses of wine characteristics and review patterns, the next section focuses on predictive modelling to identify the key variables that influence wine quality. Using the UCI Wine Quality dataset as the foundation, several statistical and machine learning techniques were applied to model the relationship between physicochemical properties and expert assigned quality scores.

This modelling process not only highlights the most influential predictors of wine quality but also allows for the evaluation of different modelling approaches, including linear regression, random forest, and ordinal forest models. Through this, we aim to determine which attributes most strongly contribute to a wine’s perceived quality and assess the overall predictive performance of each model.

Correlation Matrix of Physicochemical Features

The correlation matrix below presents the pairwise relationships between all numeric variables in the Vinho Verde wine dataset, including the expert-assigned quality score. The matrix serves as an important exploratory tool to identify potential linear associations between the wine’s physicochemical properties—such as alcohol, acidity, sulphates, and pH—and its rated quality.

Each cell in the matrix displays the Pearson correlation coefficient between two variables, which ranges from -1 to +1:

  • A positive value indicates a direct relationship (as one variable increases, so does the other).

  • A negative value indicates an inverse relationship (as one increases, the other decreases).

  • A value near 0 implies little or no linear relationship.

The colours in the matrix represent the strength and direction of the correlations:

  • Dark blue indicates strong positive correlations.

  • Dark red indicates strong negative correlations.

  • Lighter shades (closer to white) indicate weak or no correlation.

This matrix provides early insights into which variables may be important predictors of wine quality and informs feature selection in the modelling phase. For instance, if alcohol shows a strong positive correlation with quality, it may be expected to have a significant role in predictive modelling.

Code
# ----------------------------------------------------------
# Correlation Matrix of Numeric Features
# ----------------------------------------------------------
# Extract numeric features and rename quality for correlation analysis
numeric_features <- vinho_verde_data %>% 
  select_if(is.numeric) %>% 
  rename(Quality = quality_num)

# Save current graphics settings
old_par <- par(no.readonly = TRUE)

# Set outer margin to allow space for the title
par(oma = c(0, 0, 3, 0))  # top margin = 3 lines

# Format column names nicely
colnames(numeric_features) <- str_to_title(gsub("_", " ", colnames(numeric_features)))

# Plot the correlation matrix with black labels
corrplot(
  cor(numeric_features),
  method = "color",
  tl.cex = 0.8,
  tl.col = "black"
)

# Add a centred heading at the top
mtext("Figure 5 - Correlation Matrix of Wine Data", outer = TRUE, cex = 1.5, col = "black", side = 3, line = 1, adj = 0.5)

Code
# Restore original graphics settings
par(old_par)
Code
# ----------------------------------------------------------
# Train-Test Split for Modeling
# ----------------------------------------------------------
# Generate random index for 80% training split
vinho_verde_index <- sample(seq_len(nrow(vinho_verde_data)), size = 0.8 * nrow(vinho_verde_data))
# Create training set (80% of data)
vinho_verde_train_set <- vinho_verde_data[vinho_verde_index, ]
# Create test set (remaining 20% of data)
vinho_verde_test_set <- vinho_verde_data[-vinho_verde_index, ]

# Subset training set for numeric modeling
vinho_verde_train_numeric <- vinho_verde_train_set %>% dplyr::select(quality, fixed_acidity:alcohol)
# Subset test set for numeric modeling
vinho_verde_test_numeric <- vinho_verde_test_set %>% dplyr::select(quality, fixed_acidity:alcohol)

# Store ordered quality levels for consistency
quality_levels <- levels(vinho_verde_data$quality)
# Apply ordered factor to training set
vinho_verde_train_numeric$quality <- factor(vinho_verde_train_numeric$quality, levels = quality_levels, ordered = TRUE)
# Apply ordered factor to test set
vinho_verde_test_numeric$quality <- factor(vinho_verde_test_numeric$quality, levels = quality_levels, ordered = TRUE)
Code
# ----------------------------------------------------------
# Fit Ordinal Forest Model
# ----------------------------------------------------------
# Determine mtry parameter for ordinal forest
mtry_val <- floor(sqrt(ncol(vinho_verde_train_numeric) - 1))
of_model <- ordfor(depvar = "quality", data = as.data.frame(vinho_verde_train_numeric),
                   nsets = 100, ntreeperdiv = 100, ntreefinal = 500, mtry = mtry_val)

# Generate ordinal forest predictions on the test set
preds <- predict(of_model, newdata = as.data.frame(vinho_verde_test_numeric))[[1]]
# Recode predicted class indices to original quality levels
preds_recoded <- factor(quality_levels[as.numeric(preds)], ordered = TRUE, levels = quality_levels)
Code
# ----------------------------------------------------------
# Model Evaluation
# ----------------------------------------------------------
# Create a confusion matrix to evaluate classification accuracy
conf_matrix <- table(Predicted = preds_recoded, Actual = vinho_verde_test_numeric$quality)


# Calculate overall prediction accuracy
accuracy <- sum(diag(conf_matrix)) / sum(conf_matrix)
# Compute weighted Kappa to assess ordinal agreement
kappa <- kappa2(data.frame(Predicted = preds_recoded, Actual = vinho_verde_test_numeric$quality), weight = "squared")

# Convert actual quality values to numeric for RMSE and MAE
actual_numeric <- as.numeric(as.character(vinho_verde_test_numeric$quality))
# Convert predicted quality values to numeric for RMSE and MAE
predicted_numeric <- as.numeric(as.character(preds_recoded))
# Calculate Root Mean Squared Error
rmse <- sqrt(mean((predicted_numeric - actual_numeric)^2))
# Calculate Mean Absolute Error
mae <- mean(abs(predicted_numeric - actual_numeric))

#cat("Accuracy:", accuracy, "\n")
#print(kappa)
#cat("RMSE:", rmse, "\n")
#cat("MAE:", mae, "\n")

Modelling Wine Quality Using Ordinal Forest

To model and predict wine quality, the dataset was first divided into training and testing sets, with 80% of the data used for training and 20% reserved for testing. This split ensures that the model is trained on one subset of the data and then evaluated on unseen observations to assess its generalisability and predictive performance.

Why Use the Ordinal Forest Method?

Wine quality ratings are ordinal in nature—they follow a ranked scale (typically from 3 to 9), but the distances between the levels are not necessarily equal. Traditional regression or classification models might ignore this structure by treating the outcome either as continuous or nominal.

The Ordinal Forest algorithm is specifically designed for ordered categorical outcomes. It combines the robustness of random forest modelling with the ability to respect the ordinal structure of the dependent variable. This makes it particularly well-suited for predicting wine quality ratings, which are discrete and ranked.

Outcomes and Model Evaluation

After fitting the ordinal forest model, predictions were made on the test set and evaluated using several performance metrics:

  • Accuracy: The model correctly predicted the exact quality rating in 68.7% of cases.
  • Weighted Cohen’s Kappa: A kappa value of 0.676 indicates substantial agreement between predicted and actual ratings.
  • RMSE (Root Mean Squared Error): 0.641 – this reflects the average squared difference between predicted and true ratings.
  • MAE (Mean Absolute Error): 0.345, meaning predictions were on average within a third of a point from the actual score.

The confusion matrix further illustrates that the model performs well across the central quality scores (5, 6, and 7), but is less confident predicting extreme values (e.g. 3, 9), which are rare in the dataset.

Code
cat("Figure 6 - Confusion Matrix", "\n\n")
Figure 6 - Confusion Matrix 
Code
conf_matrix
         Actual
Predicted   3   4   5   6   7   8   9
        3   0   0   1   0   0   0   0
        4   2   5   1   1   0   0   0
        5   1  27 333  94   8   0   0
        6   1  12 104 421  96  10   0
        7   0   1   2  31 122  12   0
        8   0   0   0   2   1  12   0
        9   0   0   0   0   0   0   0
Code
#cat("MAE:", mae, "\n")

Visualising Model Performance

The grouped bar plot (Figure 7) below compares the distribution of actual vs predicted wine quality ratings. While the overall pattern is preserved, slight under- or over-estimation is observed at the mid-range scores, particularly between ratings of 5 and 7. The absence of predictions for rare classes like 3 and 9 reflects their underrepresentation in the training data, a common issue in ordinal modelling.

This modelling approach demonstrates that physicochemical attributes can explain a meaningful portion of variance in wine quality ratings, and ordinal forest provides a strong foundation for accurate prediction in this domain.

Code
# ----------------------------------------------------------
# Plot: Actual vs Predicted Wine Quality
# ----------------------------------------------------------

# Create dataframe containing actual wine quality ratings from the test set
actual_df <- data.frame(WineQuality = vinho_verde_test_numeric$quality, Source = "Actual")

# Create dataframe containing predicted wine quality ratings
pred_df <- data.frame(WineQuality = preds_recoded, Source = "Predicted")

# Combine the actual and predicted results into a single dataset
combined <- rbind(actual_df, pred_df)

# Count the number of wines for each quality rating by source (Actual vs Predicted)
combined <- combined %>% dplyr::count(WineQuality, Source)

# Generate a grouped bar chart to visually compare actual and predicted wine quality distributions
ggplot(combined, aes(x = WineQuality, y = n, fill = Source)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(
    title = "Actual vs Predicted Wine Quality",
    x = "Quality Rating",
    y = "Count",
    caption = "Figure 7: Comparison of actual and predicted quality ratings."
  ) +
  theme_minimal() +  # Clean theme with readable font size
  theme(
    plot.title = element_text(hjust = 0.5, face = "bold"),  # Centred and bolded title
    axis.text = element_text(color = "black"),              # Axis label colour
    axis.title = element_text(color = "black"),             # Axis title colour
    legend.title = element_text(face = "bold"),             # Bold legend
    plot.caption = element_text(size = 10, colour = "gray30", hjust = 0)  # Styled caption
  )

Variable Importance from the Ordinal Forest Model

The plot (Figure 8) below ranks the predictor variables by their relative importance in the ordinal forest model used to predict wine quality. This measure of importance reflects the contribution each variable makes toward improving the model’s predictive accuracy. Variables that frequently contribute to effective splits in the decision trees receive higher scores.

These importance values are scaled relative measures, helping us understand which physicochemical features have the strongest impact on model performance. A higher value indicates greater influence in predicting wine quality, but the values should be interpreted comparatively rather than absolutely.

Several observations stand out:

  • Alcohol is by far the most influential predictor, consistent with our earlier correlation matrix (Figure 5), where alcohol showed a strong positive correlation with quality (indicated by a dark blue colour). This suggests that higher alcohol levels are generally associated with higher-rated wines in this dataset.

  • Density also ranked highly in importance, but in contrast, it showed a negative correlation with quality (represented by a medium red shade in the correlation matrix). This implies that higher density may be associated with lower quality, yet it remains a useful variable in differentiating wine types.

  • Interestingly, alcohol and density were shown to be strongly negatively correlated with each other (deep red cell in the matrix), meaning as alcohol content increases, density tends to decrease. This inverse relationship helps explain why both variables can be important in the model—each provides distinct but complementary signals for predicting quality. The strong negative correlation observed between alcohol and density can be explained by their chemical relationship during the fermentation process. As sugar in the grape must is converted into ethanol, the alcohol content increases, while the overall density of the wine decreases. This is because ethanol has a lower density than water, and the fermentation process reduces the sugar content (which contributes to higher density) while increasing alcohol. As a result, wines with higher alcohol tend to be less dense. This inverse relationship is well established in enology and is routinely used as part of quality control and fermentation monitoring in winemaking (Jackson, 2020).

  • Other variables such as volatile acidity, chlorides, and free sulfur dioxide also contribute meaningfully to the model, while predictors like fixed acidity, pH, and sulphates appear to have less influence.

By combining insights from both the correlation matrix and the ordinal forest model, we gain a more comprehensive understanding: features that are both strongly correlated with quality and structurally useful in the tree-based model tend to emerge as top predictors. Meanwhile, the model’s ability to capture interactions and non-linear effects ensures that variables like density still hold value even if their relationship with quality is inverse or complex.

Code
# ----------------------------------------------------------
# Variable Importance Plot
# ----------------------------------------------------------

# Create a dataframe from the variable importance vector of the ordinal forest model
varimp_df <- data.frame(Predictor = names(of_model$varimp), Importance = of_model$varimp)

# Clean and format predictor names (e.g., replace underscores with spaces and capitalise for readability)
varimp_df$Predictor <- str_to_title(gsub("_", " ", varimp_df$Predictor))

# Sort the dataframe in ascending order of importance for clearer visual ranking
varimp_df <- varimp_df %>% arrange(Importance)

# Convert Predictor column to factor to maintain the sorted order in the plot
varimp_df$Predictor <- factor(varimp_df$Predictor, levels = varimp_df$Predictor)

# Create a horizontal bar chart showing the importance of each predictor variable
ggplot(varimp_df, aes(x = Predictor, y = Importance)) +
  geom_col(fill = "steelblue") +
  coord_flip() +
  labs(
    title = "Variable Importance (Ordinal Forest)",
    x = "", y = "Importance",
    caption = "Figure 8: Importance of variables in predicting wine quality"
  ) +
  theme_minimal() +  # Clean theme with readable font size
  theme(
    plot.title = element_text(hjust = 0.5, face = "bold"),  # Centred and bolded title
    axis.text = element_text(color = "black"),              # Axis label colour
    axis.title = element_text(color = "black"),             # Axis title colour
    legend.title = element_text(face = "bold"),             # Bold legend
    plot.caption = element_text(size = 10, colour = "gray30", hjust = 0)  # Styled caption
  )


Summary

This report presents a comprehensive analysis of Portuguese wines using two complementary datasets: the UCI Wine Quality dataset, which contains physicochemical properties and expert-rated quality scores, and the WineMag (Kaggle) critic reviews dataset, which includes scores, descriptions, and prices. Through a combination of exploratory data analysis, statistical testing, machine learning modelling, and natural language processing (NLP), the report investigates the factors that influence wine quality and perception.

Key Findings

  • Review vs. Quality Scores: Red wines received higher median review scores (89) compared to white wines (87), while in the UCI quality dataset, white wines were more prevalent and scored slightly lower on average. The majority of wines—both red and white—were rated between 5 and 7, with very few wines receiving extreme scores (3 or 9), indicating a central tendency in objective quality assessment.

  • Price and Score Relationship: A moderate-to-strong positive correlation (r = 0.64) was found between log-transformed wine price and review score. Red wines showed a steeper relationship, suggesting that higher-priced red wines tend to score better, whereas white wines had lower prices and generally more modest scores.

  • Standardised Comparison of Scores: When standardising both quality scores and review scores, red wines tended to score higher in critic reviews than in quality assessments, while white wines scored higher in quality assessments than in critic reviews. This highlights potential differences in subjective perception versus objective measurements across wine types.

  • NLP Analysis of Descriptions: Word clouds and logistic regression applied to reviews scoring over 90 revealed that terms such as tannin, fruit, flavours, acidity, and rich were strongly associated with high-quality wines. These terms reflect common tasting descriptors that align with both chemical properties and consumer appeal (Jackson, 2020).

  • Correlation Insights: The correlation matrix revealed a strong positive correlation between alcohol and quality, and a negative correlation between density and quality. Additionally, alcohol and density themselves were strongly negatively correlated, which is consistent with wine chemistry—alcohol lowers density during fermentation (Jackson, 2020).

  • Predictive Modelling with Ordinal Forest:

    • An ordinal forest model was trained to predict wine quality using 80% of the data and evaluated on the remaining 20%.
    • The model achieved an accuracy of 68.7%, a Cohen’s weighted Kappa of 0.676, RMSE of 0.641, and MAE of 0.345, indicating strong agreement between predicted and actual quality levels.
    • The variable importance plot showed that alcohol was by far the most influential predictor, followed by density and volatile acidity. This aligns with the correlation analysis and confirms the significance of these variables in driving wine quality.

Conclusion

This project demonstrates how structured and unstructured data can be integrated to gain meaningful insights into wine quality. By combining statistical methods, machine learning, and natural language analysis, we identified key chemical and linguistic indicators of quality and developed a predictive model capable of estimating wine ratings with substantial accuracy.

The findings are particularly relevant for industry stakeholders like Treasury Wine Estates (TWE), who can benefit from the modelling framework to optimise production, marketing, and pricing strategies. The accompanying interactive Shiny app further enables users to explore and predict wine quality dynamically, making the results accessible and actionable for broader audiences.


References

Dua, D., & Graff, C. (2017). Wine Quality Data Set. UCI Machine Learning Repository. University of California, Irvine, School of Information and Computer Sciences. Retrieved from https://archive.ics.uci.edu/ml/datasets/wine+quality

Jackson, R. S. (2020). Wine tasting: A professional handbook (4th ed.). Academic Press.

Jackson, R. S. (2020). Wine science: Principles and applications (5th ed.). Academic Press.

Roy, U. (2024, May 22). 4 Reasons White Wines Are Usually Cheaper Than Red Wines. Slurrp. https://www.slurrp.com/article/4-reasons-white-wines-are-usually-cheaper-than-red-wines-1716392774297

Treasury Wine Estates. (2024). Annual Report 2023: Premiumisation & Global Growth Strategy. Retrieved from https://www.tweglobal.com/investors/reports-presentations

Zynicide. (2017). Wine Reviews. Kaggle. Retrieved from https://www.kaggle.com/datasets/zynicide/wine-reviews

ChatGPT (OpenAI, 2025) was used throughout the project to assist with code formatting, improve the consistency of sentence structure and narrative flow, and support the refinement of analytical interpretations.